Matching and Maximizing? A neurally plausible model of stochastic reinforcement learning
نویسندگان
چکیده
An influential model of how reinforcement learning occurs in human brains is the one pioneered by Suri and Schultz [6]. The model was originally designed and tested for learning tasks in deterministic environments. This paper investigates if and how the model can be extended to also apply to learning in stochastic environments. It is known that if rewards are probabilistically coupled to actions that humans tend to display a suboptimal type of behavior, called matching, where the probability of selecting a given response equals the probability of a reward for that response. Animal experiments suggest that humans are unique in this respect. That is, non-human animals display an optimal type of behavior, called maximizing, where responses with the maximum probability of a reward are consistently selected. We first show that the model in its original form becomes inert when confronted with a stochastic environment. We then consider two natural adjustments to the model and observe that one of them leads to matching behavior and the other leads to maximizing behavior. The results yield a deeper insight in the workings of the model and may provide a basis for a better understanding of the learning mechanisms implemented by human and animal brains. When we enter the world as infants there are many things we cannot do yet, but as the years pass our behavioral repertoire increases vastly. For example, we learn to sit, to walk, to talk, to read, etc. Understanding the human ability for learning seems crucial for understanding how we come to display all kinds of adaptive and intelligent behaviors. Such an understanding can also be put to practical use in the context of artificial intelligence, as it may afford building machines that can learn all that humans can learn. In this paper we focus on a specific, yet common, form of learning called reinforcement learning. Reinforcement learning is a type of learning where the learner is given minimal information about his or her performance on the task that has to be learned. Feedback is given on preformed actions, but no feedback is given about what feedback other actions would have yielded. In the task used in this paper the only feedback given is whether the given action was correct or not. In 1998, Schultz discovered a systematic relationship between the activity of dopamine neurons and reinforcement learning [5]. Soon after, Suri and Schultz used these insights to propose a biologically inspired model of reinforcement learning [6]. These authors were among the first to propose a model of this type, but see also for example [2]. Even though many advances have been made in this field since, the model of Suri and Schultz incorporated many of the general principles that are still used to date. Also, the model is less complicated than many of its successors. These two properties make the model well suited for testing the essence of this class of models. Suri and Schultz [6] trained their model on a deterministic task with delayed rewards. Not only was the model capable of learning to perform the task, but Suri and Schultz observed that the model followed a learning curve similar to that of monkeys and activations of key components in the model qualitatively matched the pattern of neural activity in the monkey’s basal ganglia. These results are impressive and show that the neurophysiological mechanisms underlying reinforcement learning in the brain can be captured in computational models. One important limitation, however, is that the results were attained specifically using a deterministic task (i.e., a task in which an action is guaranteed to yield the same feedback every time it is performed). The real world is inherently uncertain. Almost every action we can perform will sometimes be successful and sometimes not. A stronger test of the model, and its ecological validity, could thus be achieved if we test its performance on a stochastic task. In a stochastic task each action has some probability 0 < p < 1 of being successful. In this situation it is nearly impossible to select actions that are always successful. In the literature, two qualitatively different strategies have been reported, where one is known to be specifically associated with human performance and the other with animal performance [1, 3, 7, 8]. Humans tend to use a suboptimal strategy, called matching, which consists of selecting a given action with a probability that equals the probability of a reward for that action. In contrast, non-human animals tend to use an optimal strategy, called maximizing, which consists of always selecting the action with the maximum probability of a reward. Given this known characteristic difference between humans and non-human animals in how they perform on a stochastic task, it is of interest to investigate if the model of Suri and Schultz can model either one of these strategies or possibly both. By investigating what settings cause the model to exhibit these strategies we may gain a deeper understanding of the reinforcement learning mechanisms operational in human and animal brains. The remainder of this paper is organized as follows. We start, in Section 1, with a description of the model. Section 2 presents details of the task and the strategies used by humans and animals. Next we report on the results of our simulations. We show how, and explain why, the model in its original form becomes inert when trained on the stochastic task (Section 3). We consider two classes of adaptations to the model, both sufficing to overcome the initial inertia. We observe that the first class of adaptations causes the model to display the matching strategy (Section 4.1), and that the second class of adaptations causes the model to display the maximizing strategy (Section 4.2). We conclude by discussing the significance and implications of our findings in Section 5.
منابع مشابه
SHRUTI-agent: a structured connectionist model of decision-making
A neurally plausible connectionist model of decisionmaking, based on the SHRUTI architecture, is being devloped. Toward this end, issues of appropriate connectionist representations for belief and utility, necessary control mechanisms, and reinforcement-based learning are addressed.
متن کاملHebbian learning for deciding optimally among many alternatives (almost)
Reward-maximizing performance and neurally plausible mechanisms for achieving it have been completely characterized for a general class of two-alternative decision making tasks, and data suggest that humans can implement the optimal procedure. A greater number of alternatives complicates the analysis, but here too, analytical approximations to optimality that are physically and psychologically ...
متن کاملA More General Model of Cooperation Based on Reinforcement Learning: Alignment and Integration of the Bush-mosteller and the Stochastic Collusion and the Power Law of Learning: Aligning and Integrating the Bush-mosteller and the Roth-erev Reinforcement Learning Models of Cooperation
Analytical game theory has developed the Nash equilibrium as theoretical tool for the analysis of cooperation and conflicts in interdependent decision making. Indeterminacy and demanding rationality assumptions of the Nash equilibrium have led cognitive game theorists to explore learning-theoretic models of behavior. Two prominent examples are the Bush-Mosteller stochastic learning model and th...
متن کاملNeural mechanism for stochastic behaviour during a competitive game
Previous studies have shown that non-human primates can generate highly stochastic choice behaviour, especially when this is required during a competitive interaction with another agent. To understand the neural mechanism of such dynamic choice behaviour, we propose a biologically plausible model of decision making endowed with synaptic plasticity that follows a reward-dependent stochastic Hebb...
متن کاملA Spiking Neural Network Model of Model-Free Reinforcement Learning with High-Dimensional Sensory Input and Perceptual Ambiguity
A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the...
متن کامل